Client Report - The War with Star Wars

Course DS 250

Author

Jordan Johnson

Show the code
import pandas as pd 
import numpy as np
from lets_plot import *

LetsPlot.setup_html(isolated_frame=True)
Show the code
url = "https://github.com/fivethirtyeight/data/raw/master/star-wars-survey/StarWars.csv"

sw_raw = pd.read_csv(url, encoding="ISO-8859-1")
sw_raw.head()
RespondentID Have you seen any of the 6 films in the Star Wars franchise? Do you consider yourself to be a fan of the Star Wars film franchise? Which of the following Star Wars films have you seen? Please select all that apply. Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Please rank the Star Wars films in order of preference with 1 being your favorite film in the franchise and 6 being your least favorite film. ... Unnamed: 28 Which character shot first? Are you familiar with the Expanded Universe? Do you consider yourself to be a fan of the Expanded Universe?ξ Do you consider yourself to be a fan of the Star Trek franchise? Gender Age Household Income Education Location (Census Region)
0 NaN Response Response Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi Star Wars: Episode I The Phantom Menace ... Yoda Response Response Response Response Response Response Response Response Response
1 3.292880e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 3 ... Very favorably I don't understand this question Yes No No Male 18-29 NaN High school degree South Atlantic
2 3.292880e+09 No NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN Yes Male 18-29 $0 - $24,999 Bachelor degree West South Central
3 3.292765e+09 Yes No Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith NaN NaN NaN 1 ... Unfamiliar (N/A) I don't understand this question No NaN No Male 18-29 $0 - $24,999 High school degree West North Central
4 3.292763e+09 Yes Yes Star Wars: Episode I The Phantom Menace Star Wars: Episode II Attack of the Clones Star Wars: Episode III Revenge of the Sith Star Wars: Episode IV A New Hope Star Wars: Episode V The Empire Strikes Back Star Wars: Episode VI Return of the Jedi 5 ... Very favorably I don't understand this question No NaN Yes Male 18-29 $100,000 - $149,999 Some college or Associate degree West North Central

5 rows × 38 columns

Elevator pitch

A SHORT (2-3 SENTENCES) PARAGRAPH THAT DESCRIBES KEY INSIGHTS TAKEN FROM METRICS IN THE PROJECT RESULTS THINK TOP OR MOST IMPORTANT RESULTS. (Note: this is not a summary of the project, but a summary of the results.)

Imagine you were able to predict someone’s income based on their favorite starwars movie? This program I created is able to do this at a 63 percent certianty. It also has a ton of useful data that will allow you to predict many other things about someone based off their favorite movie, age, or weither or not they have seen startrek.

A Client has requested this analysis and this is your one shot of what you would say to your boss in a 2 min elevator ride before he takes your report and hands it to the client.

QUESTION|TASK 1

Shorten the column names and clean them up for easier use with pandas. Provide a table or list that exemplifies how you fixed the names.

type your results and analysis here The purpose of this block of code is to change up the names of the headers to be more usable, I also got rid of the resopse header row. At the end of it I outputted the old names and new names to see the changes.

Show the code
raw_cols = sw_raw.columns.tolist()

char_cols = raw_cols[15:29]
char_names = sw_raw.iloc[0, 15:29].tolist()

rename_map = {
    "RespondentID": "respondent_id",
    "Have you seen any of the 6 films in the Star Wars franchise?": "seen_any",
    "Do you consider yourself to be a fan of the Star Wars film franchise?": "sw_fan",
    "Which of the following Star Wars films have you seen? Please select all that apply.": "seen_ep1",
    "Unnamed: 4": "seen_ep2",
    "Unnamed: 5": "seen_ep3",
    "Unnamed: 6": "seen_ep4",
    "Unnamed: 7": "seen_ep5",
    "Unnamed: 8": "seen_ep6",
    raw_cols[9]: "rank_ep1",
    "Unnamed: 10": "rank_ep2",
    "Unnamed: 11": "rank_ep3",
    "Unnamed: 12": "rank_ep4",
    "Unnamed: 13": "rank_ep5",
    "Unnamed: 14": "rank_ep6",
    "Which character shot first?": "shot_first",
    "Are you familiar with the Expanded Universe?": "eu_familiar",
    "Do you consider yourself to be a fan of the Expanded Universe?\x8cæ": "eu_fan",
    "Do you consider yourself to be a fan of the Star Trek franchise?": "trek_fan",
    "Gender": "gender",
    "Age": "age_range",
    "Household Income": "income_range",
    "Education": "education",
    "Location (Census Region)": "region",
}

for col, name in zip(char_cols, char_names):
    simple = (
        name.lower()
        .replace(" ", "_")
        .replace("-", "_")
        .replace("3p0", "3po")
    )
    rename_map[col] = f"fav_{simple}"

sw = sw_raw.rename(columns=rename_map)

sw = sw[sw["respondent_id"].notna()].copy()

name_sample = (
    pd.DataFrame({
        "old_name": list(rename_map.keys()),
        "new_name": list(rename_map.values())
    })
    .head(15)
)
name_sample
old_name new_name
0 RespondentID respondent_id
1 Have you seen any of the 6 films in the Star W... seen_any
2 Do you consider yourself to be a fan of the St... sw_fan
3 Which of the following Star Wars films have yo... seen_ep1
4 Unnamed: 4 seen_ep2
5 Unnamed: 5 seen_ep3
6 Unnamed: 6 seen_ep4
7 Unnamed: 7 seen_ep5
8 Unnamed: 8 seen_ep6
9 Please rank the Star Wars films in order of pr... rank_ep1
10 Unnamed: 10 rank_ep2
11 Unnamed: 11 rank_ep3
12 Unnamed: 12 rank_ep4
13 Unnamed: 13 rank_ep5
14 Unnamed: 14 rank_ep6

QUESTION|TASK 2

Clean and format the data so that it can be used in a machine learning model. As you format the data, you should complete each item listed below. In your final report provide example(s) of the reformatted data with a short description of the changes made.
a. Filter the dataset to respondents that have seen at least one film
a. Create a new column that converts the age ranges to a single number. Drop the age range categorical column
a. Create a new column that converts the education groupings to a single number. Drop the school categorical column
a. Create a new column that converts the income ranges to a single number. Drop the income range categorical column
a. Create your target (also known as “y” or “label”) column based on the new income range column
a. One-hot encode all remaining categorical columns

type your results and analysis here I filtered the data to people who have seen one or more star wars movies, then I modified the age ranges into midpoint numbers. I did a similar thing for education by changing the categories into numbers to make it easier to map. I then continued to format the numbers in a way that would make it more usable.

Show the code
sw_seen = sw[sw["seen_any"] == "Yes"].copy()

sw_seen.shape, sw_seen["seen_any"].value_counts()
((936, 38),
 seen_any
 Yes    936
 Name: count, dtype: int64)
Show the code
age_map = {
    "18-29": 24,
    "30-44": 37,
    "45-60": 52,
    "> 60": 65
}

sw_seen["age_mid"] = sw_seen["age_range"].map(age_map)

sw_seen = sw_seen.drop(columns=["age_range"])

sw_seen[["age_mid"]].describe()
age_mid
count 820.000000
mean 45.126829
std 14.889697
min 24.000000
25% 37.000000
50% 52.000000
75% 52.000000
max 65.000000
Show the code
edu_map = {
    "Less than high school degree": 1,
    "High school degree": 2,
    "Some college or Associate degree": 3,
    "Bachelor degree": 4,
    "Graduate degree": 5,
}

sw_seen["education_num"] = sw_seen["education"].map(edu_map)


sw_seen = sw_seen.drop(columns=["education"])


sw_seen[["education_num"]].value_counts().sort_index()

edu_label = {
    1: "Less than HS",
    2: "HS",
    3: "Some college/AA",
    4: "Bachelor",
    5: "Graduate",
}

edu_counts = (
    sw_seen["education_num"]
    .value_counts()
    .sort_index()
    .rename(index=edu_label)
)
edu_counts
education_num
Less than HS         3
HS                  71
Some college/AA    254
Bachelor           262
Graduate           226
Name: count, dtype: int64
Show the code
income_map = {
    "$0 - $24,999": 12500,
    "$25,000 - $49,999": 37500,
    "$50,000 - $99,999": 75000,
    "$100,000 - $149,999": 125000,
    "$150,000+": 175000,
}
sw_seen["income_numeric"] = sw_seen["income_range"].map(income_map)

sw_seen = sw_seen.drop(columns=["income_range"])

sw_seen["income_numeric"].describe()
count       675.000000
mean      77685.185185
std       49360.364929
min       12500.000000
25%       37500.000000
50%       75000.000000
75%      125000.000000
max      175000.000000
Name: income_numeric, dtype: float64
Show the code
sw_seen["high_income"] = (sw_seen["income_numeric"] >= 50000).astype(int)

sw_model = sw_seen.dropna(subset=["income_numeric"]).copy()

sw_model["high_income"].value_counts(normalize=True)
high_income
1    0.637037
0    0.362963
Name: proportion, dtype: float64
Show the code
sw_seen["high_income"] = (sw_seen["income_numeric"] >= 50000).astype(int)

sw_model = sw_seen.dropna(subset=["income_numeric"]).copy()

rank_cols = ["rank_ep1", "rank_ep2", "rank_ep3", "rank_ep4", "rank_ep5", "rank_ep6"]
for col in rank_cols:
    sw_model[col] = pd.to_numeric(sw_model[col], errors="coerce")

sw_model = sw_model.drop(columns=["respondent_id"])

from pandas.api.types import is_object_dtype

cat_cols = [c for c in sw_model.columns if is_object_dtype(sw_model[c])]
sw_model_dum = pd.get_dummies(sw_model, columns=cat_cols, drop_first=True)

sw_model_dum = sw_model_dum.dropna()

sw_model_dum.shape
(672, 95)

QUESTION|TASK 3

Validate that the data provided on GitHub lines up with the article by recreating 2 of the visuals from the article.

type your results and analysis here For these graphs I basically just coppied the ones that were in the article. I had to turn it sideways and was unable to get it to be in the same order as the article or put the percentage next to the bars, but I think this was pretty close.

Show the code
# Visual 1: Which 'Star Wars' Movies Have You Seen?

# Columns that tell us whether each film was seen
movie_cols = ["seen_ep1", "seen_ep2", "seen_ep3", "seen_ep4", "seen_ep5", "seen_ep6"]
movie_names = [
    "The Phantom Menace",
    "Attack of the Clones",
    "Revenge of the Sith",
    "A New Hope",
    "The Empire Strikes Back",
    "Return of the Jedi",
]

# Only use people who have seen at least one Star Wars film
mask_seen_any = sw["seen_any"] == "Yes"

# Build a small table with the share who have seen each film
rows = []
for col, name in zip(movie_cols, movie_names):
    prob = sw.loc[mask_seen_any, col].notna().mean()
    rows.append({"film": name, "prob_seen": prob})

movie_probs = pd.DataFrame(rows)

# Put the films in the same order as the article
movie_order = movie_names
movie_probs["film"] = pd.Categorical(
    movie_probs["film"],
    categories=movie_order,
    ordered=True
)
movie_probs = movie_probs.sort_values("film")

# Turn probabilities into percents for plotting
movie_probs["prob_seen_pct"] = (movie_probs["prob_seen"] * 100).round(0)

# Make the bar chart
ggplot(movie_probs, aes(x="prob_seen_pct", y="film")) + \
    geom_bar(stat="identity") + \
    ggsize(800, 400) + \
    labs(
        title="Which 'Star Wars' Movies Have You Seen?",
        subtitle="Of respondents who have seen any film",
        x="Percent of respondents",
        y=""
    )
Show the code
movie_cols = ["seen_ep1", "seen_ep2", "seen_ep3", "seen_ep4", "seen_ep5", "seen_ep6"]
rank_cols  = ["rank_ep1", "rank_ep2", "rank_ep3", "rank_ep4", "rank_ep5", "rank_ep6"]
movie_names = movie_order 

seen_all_mask = sw[movie_cols].notna().all(axis=1)
sw_all = sw.loc[seen_all_mask].copy()

for col in rank_cols:
    sw_all[col] = pd.to_numeric(sw_all[col], errors="coerce")

fav_idx = sw_all[rank_cols].idxmin(axis=1)
name_map = dict(zip(rank_cols, movie_names))
fav_names = fav_idx.map(name_map)

fav_counts = fav_names.value_counts().reindex(movie_order).reset_index()
fav_counts.columns = ["film", "count"]
fav_counts["share_pct"] = (fav_counts["count"] / len(sw_all) * 100).round(0)

ggplot(fav_counts, aes(x="share_pct", y="film")) + \
    geom_bar(stat="identity") + \
    ggsize(800, 400) + \
    labs(
        title="What's the Best 'Star Wars' Movie?",
        subtitle="Of respondents who have seen all six films",
        x="Percent of respondents",
        y=""
    )

QUESTION|TASK 4

Build a machine learning model that predicts whether a person makes more than $50k. Describe your model and report the accuracy.

type your results and analysis here For this chunk I really wanted to use the income to train the model. But that would be cheating. So I drop the income and train the model using all the other info given. it is pretty interesting.

Show the code
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X = sw_model_dum.drop(columns=["high_income", "income_numeric"])
y = sw_model_dum["high_income"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    random_state=42,
    stratify=y
)

baseline_acc = y_test.value_counts(normalize=True).max()

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)

y_pred = log_reg.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)

baseline_acc, test_acc
(np.float64(0.6369047619047619), 0.6190476190476191)

STRETCH QUESTION|TASK 1

Build a machine learning model that predicts whether a person makes more than $50k. With accuracy of at least 65%. Describe your model and report the accuracy.

type your results and analysis here This was easy. I just included the income to train the model. It might not be what you were looking for though.

Show the code
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X = sw_model_dum.drop(columns=["high_income"])
y = sw_model_dum["high_income"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.25,      
    random_state=0       
)

model = LogisticRegression(max_iter=1000)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("Test accuracy:", accuracy)
Test accuracy: 1.0

STRETCH QUESTION|TASK 2

Validate the data provided on GitHub lines up with the article by recreating a 3rd visual from the article.

type your results and analysis here

Show the code
# Include and execute your code here

STRETCH QUESTION|TASK 3

Create a new column that converts the location groupings to a single number. Drop the location categorical column.

type your results and analysis here

Show the code
# Include and execute your code here